BOVIDS: A deep learning‐based software package for pose estimation to evaluate nightly behavior and its application to common elands (Tragelaphus oryx) in zoos

Abstract Only a few studies on the nocturnal behavior of African ungulates exist so far, with mostly small sample sizes. For a comprehensive understanding of nocturnal behavior, the data basis needs to be expanded. Results obtained by observing zoo animals can provide clues for the study of wild animals and furthermore contribute to a better understanding of animal welfare and better husbandry conditions in zoos. The current contribution reduces the lack of data in two ways. First, we present a stand‐alone open‐source software package based on deep learning techniques, named Behavioral Observations by Videos and Images using Deep‐Learning Software (BOVIDS). It can be used to identify ungulates in their enclosure and to determine the three behavioral poses “Standing,” “Lying—head up,” and “Lying—head down” on 11,411 h of video material with an accuracy of 99.4%. Second, BOVIDS is used to conduct a case study on 25 common elands (Tragelaphus oryx) out of 5 EAZA zoos with a total of 822 nights, yielding the first detailed description of the nightly behavior of common elands. Our results indicate that age and sex are influencing factors on the nocturnal activity budget, the length of behavioral phases as well as the number of phases per behavioral state during the night while the keeping zoo has no significant influence. It is found that males spend more time in REM sleep posture than females while young animals spend more time in this position than adult ones. Finally, the results suggest a rhythm between the Standing and Lying phases among common elands that opens future research directions.


| General
The nocturnal behavior of many African mammals is poorly studied.
It is known that the behavioral patterns can vary greatly between day and night, as many large herbivorous mammals spend, especially in winter, most of their sleeping time during the night, while the activity patterns emerge primarily at daytime (Bennie et al., 2014;Davimes et al., 2018;Gravett et al., 2017;Wu et al., 2018). For a comprehensive understanding of diurnal rhythms, a behavioral description of the entire diurnal cycle is necessary. Currently, there are only few contributions studying the nocturnal behavior. It is much more accessible to observe zoo animals at night rather than animals in their natural habitat due to much easier installation options of the required equipment (Ryder & Feistner, 1995). In order not to disturb the animals, camera recordings are a good mean of data collection in this case. Data collected in zoos can be valuable to study animal's behavior. In various species, there are no differences found in the behavior of animals in the wild and in captivity (Hollén & Manser, 2007;Melfi & Feistner, 2002). This was verified recently for basic nocturnal activities like being in the REM sleep position between giraffes in zoos and in the wild (Burger et al., 2020). Therefore, studies conducted in zoos can provide a good basis for describing the animals' nocturnal behavior and the obtained results can subsequently serve as starting information for observations in the field (Burger et al., 2020). In addition, a deeper knowledge of nocturnal behavior inside zoo enclosures could contribute information to further improve animal management and husbandry in zoos (Brando & Buchanan-Smith, 2018) and provide conclusions on animal welfare (Walsh et al., 2019). One explicit example is that REM sleep appears to be an important indicator of stress in giraffes (Sicks, 2016), which can be measured by noninvasive methods.
To describe nocturnal behavior unambiguously, reliable data are needed, especially because there are few comparisons in literature.
This means that it would be preferable to observe multiple individuals of a species over a longer period of time to accurately describe the average behavior. Additionally, it is necessary to obtain data not only on one but on various species to close the existing knowledge gap. The extraction of meaningful information as well as a detailed evaluation of a mass of recorded data requires modern techniques to automate parts of this data mining process (Beery et al., 2020;Lürig et al., 2021;Norouzzadeh et al., 2018). In the last decade, various computer vision and deep learning techniques found their way into behavioral biology and ecology (Chakravarty et al., 2020;Dell et al., 2014;Eikelboom et al., 2019;Gerovichev et al., 2021;Norouzzadeh et al., 2021;Schneider et al., 2018Schneider et al., , 2020Valletta et al., 2017), facilitating the task of dealing with a large dataset. Unfortunately, automatization of the evaluation of video recordings is challenging if the video recordings suffer from a very low framerate (lower than 5 fps), much background noise, or heavy truncation effects, as is usual in observations in stables as zoo enclosures, or even in installments in the wild. More precisely, background noise appearing in such recordings is, for instance, due to light reflections caused by infrared emitters and particulate matter caused by the hay, while truncation and occlusion effects appear if the camera is not able to capture the whole enclosure or there are multiple overlapping animals in one stable. It is to emphasize that those negative effects are stronger the more general the setup is. Systems for automatic detection of flies or mice under perfect laboratory conditions (Graving et al., 2019;Kabra et al., 2013;Pereira et al., 2020) need to be much less robust to such effects than the system at hand for enclosures and stables.
Of course, installments in the wild, like camera-trap studies, must deal with even more noise and truncation.

| Our contribution
One of the two main objectives of this work tackles this challenge by making BOVIDS (Behavioral Observations by Videos and Images using Deep-Learning Software) available, which is a stand-alone software package based on deep learning techniques. To the best of our knowledge, this is the first fully open-source software package tackling the task of evaluating the nocturnal behavior of stalled animals that contains functionalities required for data preparation, training of the deep learning parts, data prediction, and data presentation. More precisely, BOVIDS can be used to evaluate video recordings of stalled ungulates recorded at 1 fps regarding two classification tasks: "binary classification" (a two-class classification task) and "total classification" (a three-class classification task), which are defined by Hahn-Klimroth et al. (2021) as follows.
First, if an animal is not present on an image, the desired label is Out (being out of view) in both tasks. Second, in the total classification task, the three postures Standing, Lying-head up (LHU), and Lying-head down (LHD) need to be distinguished which will be properly defined in Section 2.2. The binary classification task asks only to distinguish Standing and Lying (combining LHU and LHD) if the animal is present. All discussed software as well as detailed instructions can be found in our GitHub repository: https://github. com/Klimr oth/BOVIDS and on Zenodo (https://doi.org/10.5281/ zenodo.6143896).
As a second part of the paper, a case study is conducted that explains how BOVIDS can be applied by behavioral biologists to their own data and which statistical analyses can be directly conducted on the output of the software package. In this study, the nocturnal activity budget of 25 common elands is analyzed. To the best of our knowledge, the case study provides the first description of the nocturnal behavior of common elands. Over 11,000 h (822 nights) of video material from five different EAZA zoos were evaluated, a task that seems challenging in the absence of automatic evaluation and it is described in detail how BOVIDS can be used to observe and analyze several important behavioral biological key figures of nocturnal activity. The results contain activity budgets, which show the percentages of all examined behavioral states, a visualization of the Standing-Lying rhythm as well as an analysis of the possible influencing factors age, sex, and zoo husbandry.

| Related work
As mentioned earlier, several computational systems have found their way into behavioral biology and ecology (Chakravarty et al., 2020;Dell et al., 2014;Eikelboom et al., 2019;Norouzzadeh et al., 2021;Valletta et al., 2017). Such systems are explicitly designed with respect to the underlying data. In the easiest tasks, cameras can be installed in a laboratory such that the recordings feature a high contrast between animals and the background as well as other laboratory conditions like a given steady camera angle and low background noise. Examples for such systems working with data of Drosophila flies or mice are JAABA (Kabra et al., 2013), DeepBehavior (Graving et al., 2019), and SLEAP (Pereira et al., 2020). When data are recorded either in the natural habitat or in different zoo enclosures, it is much more challenging to collect appropriate data that are amenable to automatic evaluation, for instance due to variations in weather, brightness, and background. Furthermore, different cameras can rarely be adjusted in a way such that the recording angle matches the given requirements or to ensure that animals are not highly truncated. It is to emphasize that there are examples of systems that deal with those challenges. One approach under varying brightness conditions distinguishes the poses "Lying" and "Standing" of cows in free-stall stables (Porto et al., 2013). Furthermore, one success story is the work by Norouzzadeh et al. (2018Norouzzadeh et al. ( , 2021) whose system can automatically detect and count different species, and some shown behaviors using camera trap images of the Serengeti dataset (Swanson et al., 2015). Similar systems working with camera trap images in the wild are presented by Schneider et al. (2018Schneider et al. ( , 2020).

| MATERIAL S AND ME THODS
As the purpose of this paper is two-fold, this section is divided into several parts. In the section Data evaluation, methods and material used to collect the data of the case study and to evaluate the findings statistically are presented. Subsequently, the behavioral states of interest are defined properly in section Ethogram, whereas the section Foundations of Deep Learning introduces important concepts of machine learning used by BOVIDS. Finally, the section BOVIDS introduces and describes the single parts of the software package itself in more detail.

| Data evaluation
The dataset includes nights of 25 common elands (Tragelaphus oryx), whereas the number of nights per individual ranges from 15 to 49. In total, 822 nights with 11,411 h of video material are present. The data were collected in winter seasons between 2017 and 2020 in a total of five EAZA zoos in Germany (Allwetterzoo Münster, Erlebnis-Zoo Hannover, Opel-Zoo Kronberg, Zoo Dortmund and Zoom Erlebniswelt Gelsenkirchen). A detailed overview about the used data is given in Table A1. For further analysis the individuals are categorized as follows: "young," ranging from birth until the time of weaning with about six months, "subadult," older than six months until sexual maturity with about two years of age and "adult" afterwards. These categories are chosen according to the information distributed across multiple prior works (Groves & Leslie, 2011;Myers et al., 2021;Puschmann et al., 2009;Tacutu et al., 2013).
All collected data are in the form of video recordings. The cameras used are capable of night vision due to built-in infrared emitters (Lupus LE139HD or Lupus LE338HD with the recording device LUPUSTEC LE800HD or TECHNAXX PRO HD 720P). The recordings are made with a frame rate of 1 fps and the resolution ranges from 704 × 576 px to 1920 × 1080 px. Recording takes place in the stable during night, the time of the absence of animal keepers, which mostly ranges from 17:00 to 07:00 (14 h). In some cases, the recording time is 18:00 to 07:00 (13 h).
The data were recorded continuously providing an exact time span for every behavior with a start and an end time (Martin & Bateson, 2015). The manually annotation was governed by the opensource program BORIS, Version 7.7.3 (Friard & Gamba, 2016) and consists of 2374 h of video material out of 170 nights. BOVIDS requires the use of multiple deep neural networks for object detection (OD) and action classification (AC) as explained by Hahn-Klimroth et al. (2021) and in the following section. To train an initial object detection network, at least 400 images of every enclosure were annotated using LabelImg (Tzutalin, 2015) resulting in 11,326 images of common elands and 49,437 images of various African ungulates as already elaborated by Hahn-Klimroth et al. (2021). Following the prescribed approach, the initial action classification networks were not only trained using 170 recordings (66,466 images) of common elands but also 113,407 images of other African ungulates with comparable postures. Furthermore, two rounds of offline hard example mining (OHEM) were conducted using additionally 14,381 images of common elands and 50,262 images of other African ungulates.
Finally, the action classifiers used for common elands stalled together were fine-tuned by 24,304 images stemming from manually annotated video files and 7377 images generated through OHEM.
Detailed information can be found in Table A1.
All statistical analyses are conducted with R Studio (R Core Team, 2014) and the figures, which are not given by BOVIDS, are produced using the core functionalities of R and the package gg-plot2 (Wickham, 2016). Statistical tests are performed differently for continuous and ordinal data. To conduct a two-factor analysis of variance (ANOVA) on continuous data, normality is required which is tested by Shapiro-Wilk test for any behavior class. In case of significant deviation from normality (p < .05), a normality transformation is applied to the data by R's "bestNormalize" package (Peterson & Cavanaugh, 2020). To analyze differences between multiple groups on ordinal data, a Kruskal-Wallis test is applied. Finally, as post hoc tests on all pairs of potentially significant factors, a collection of unpaired t-tests is applied in the continuous case and, respectively, a collection of Wilcoxon tests in the ordinal case. The alpha level is adjusted by the Bonferroni-Holm adjustment in each case.

| Ethogram
The focus of this paper is to distinguish between three postures: Standing, Lying-head up (LHU), and Lying-head down (LHD).
Finally, if there is no animal present, the assigned label is out of view (Out). The latter label can also be given if only a small part of the animal is visible, from which the posture cannot be inferred.
Furthermore, the class Lying is defined as the union of LHU and LHD. The binary classification task which distinguishes Standing, Lying, and Out allows to analyze rhythms over the night as the categories "activity" and "rest" are the most prominently measured behavior stages to examine diurnal rhythms (Merrow et al., 2005). In the following ethogram, based on that of Hahn-Klimroth et al. (2021), the three behavioral states are defined and shown in Figure 1.
Standing: The animal stands in an upright position on all four hooves. The exact behavior is neglected, thus the animal could be, for instance, feeding, resting, or ruminating.
Lying-head up (LHU): The animal lies down, and its head is lifted.
The behavioral state does not distinguish if the animal is awake or in non-REM sleep. As before, the precise behavior is neglected.
Lying-head down (LHD): The animal is lying with its head resting on the ground. The head's position is beside the body or sometimes in front of it.
It is crucial to notice that LHD is the typical REM (rapid eye movement) sleep posture. REM sleep is recognized through various behavioral components as the animal is lying with its head resting due to postural atonia (Lima et al., 2005;Zepelin et al., 2005). This characteristically REM sleep position can be used to estimate the REM sleep, a common approach in the study of behavior of common elands (Zizkova et al., 2013) and cows (Ternman et al., 2014).

| Foundations of Deep Learning
In supervised machine learning tasks, one is usually interested to design a system that allows automatic prediction of new data based on manually annotated examples (Russell & Norvig, 2016). In this contribution, two excessively studied supervised learning tasks are employed: object detection and action classification.
In the easiest variant of the object detection task, an image is given as an input and the system is asked to draw a bounding box around the objects appearing in the image (bounding box regression) and to assign a class label that describes the content of each bounding box (classification). On a very high-level description, there are two different approaches to this task. In one-step object-detection a bounding box is drawn, and the corresponding label is assigned simultaneously while in two-step object-detection, those tasks are conducted sequentially (Jiao et al., 2019). Well-known representatives of one-step solutions are yolo and SSD while there are various well-known two-step architectures like FasterRCNN, MaskRCNN, or EfficientDet. Without going into much detail, comparably modern one-step architectures are mostly faster at the task as two-step architectures but perform slightly worse in the classification part (Ouchra & Belangour, 2021).
Similarly, there is a huge set of deep learning architectures designed for the action classification task. In the easiest variant, an image is given, and the system needs to assign one unique class label out of a given set of labels (Lu & Weng, 2007). Prominent architectures are ResNet (He et al., 2016), EfficientNet (Tan & Le, 2019), or CoAtNet (Dai et al., 2021). The performance of such a classifier is measured by two important metrics: the accuracy as well as the fscore (Tharwat, 2021).
Suppose a sequence of n images is predicted and image i gets label s i assigned by the classifier while its correct label, called ground-truth, was t i . Suppose furthermore that classes 0, 1, …, k exist. Therefore, there are two sequences of labels  (See & Rahman, 2015). A more classical approach toward employing the temporal dimension is the "multiple-frame encoding" (Franche & Coulombe, 2012;Ji et al., 2013) in which subsequent frames are merged into one image that is fed into the model. This approach allows capturing the temporal dimension even given a low framerate, but it is inferior to more involved strategies as soon as the framerate increases (Xu et al., 2016). This multiple-frame encoding will also be used in the present contribution, as the available video material is recorded with 1 frame per second.  Wang et al. (2020), different metrics as the accuracy as target functions of this optimization process are discussed. While the performance on the training and validation data is of great theoretical interest, in applications, one is interested in the so-called generalization accuracy. To measure this accuracy, a third dataset of manually annotated data points is required, the test set. The important difference between training and validation set is that the images in the test set were not presented to the model during training and, therefore, the model's performance on these data is a good indicator on how well the model will perform in an application. It is well-known that the performance on the test set is better, the more similar the testing images are to the images presented during training. The discrepancy between the distribution of training images and testing images is called distribution shift and machine learning models are known to be brittle even to small distribution shifts (Quiñonero-Candela et al., 2008) and, therefore, one tries to find a set of training images that represents the images in the application as best as possible.

| BOVIDS
BOVIDS is an end-to-end software package which automatically It is important to emphasize that the following description is meant to present one possible way of using a deep-learning pipeline, starting from data preparation, over training and evaluation, and ending with the preparation of real data for statistical analyses. Hereby, the used deep learning models perform well on testing sets and are known to be fast (Bochkovskiy et al., 2020;Tan & Le, 2019). The description is not meant to be the single possible way of implementing such a system. As will be shown, the system is easy to apply, and the results are satisfactory from a biologist's point of view.

| Overview
This section is devoted to give a short overview about BOVIDS' functionality. The system is designed to achieve a good performance in long-term studies using video recordings in enclosures. This includes observation of zoo animals as well as farm animal husbandry.
The goal is to tell the posture (Standing, LHU, LHD) of the observed animals at any time in the video with high precision to describe its fundamental behavior as well as possible.
Manual annotation of a video file of one night (14 h) by a trained person requires roughly about two hours which indicates that only a few video files out of a longer observation period can be evaluated manually. This is a challenge as one is confronted with two problems in designing a valid training set for a deep learning model. First, the postures Standing, LHU, and LHD are highly imbalanced such that .
out of 14 h of video material, only a small portion can be easily used in a training set. It is of course possible to train on imbalanced data, but even this has limitations . Second, on different nights, the video recordings may vary due to changes in external conditions, like brightness or positioning of hay. Therefore, data recorded on different nights undergo a mild distribution shift. As manual annotation of many nights is very time-consuming and annotation of random periods of each night might cause an even more severe class imbalance, this contribution suggests an adaptation of a process called "offline hard example mining" (Felzenszwalb et al., 2010). This approach tries to minimize human working load by the cost of higher computational cost in an iterative process. Miao et al. (2021) conducted an extensive study on such iterative processes and analyzed its performance with respect to deep-learning models that evaluate camera-trap images.
In the following section, a high-level sketch of the functionalities of BOVIDS is given and the details can be found in the subsequent sections. BOVIDS is divided into four components: BOV 1. Data collection, BOV 2. Object detection (OD), BOV 3. Action classification (AC), BOV 4. Data prediction.
While a part of BOV 4 is a significantly improved and extended version of work presented in an earlier contribution (Hahn-Klimroth et al., 2021), the newly developed components BOV 1-BOV 3 allow an interested user to apply the complete system comfortably to their own data. The software package consists of various small python scripts that allow to handle large datasets more conveniently and F I G U R E 2 Visualization of the sequential application of the yolov4 object detector and the EfficientNet-B3 action classifier F I G U R E 3 Overview of the System BOVIDS and all its categories prepare the data in a way that can be used to apply the prediction pipeline BOV 4.
BOV 1 allows to convert video recordings directly from the LUPUS observation system. To annotate new data automatically, the prediction pipeline of BOVIDS (BOV 4) is used. The necessary scripts to prepare the training and validation set and to conduct the training are presented in BOV 2 for the object detector, while BOV 3 provides these functionalities with regard to the action classifier.
Furthermore, those sections contain a description of one possibility to fine-tune the models and achieve a good performance. Finally, multiple tools to measure the accuracy of the prediction and to detect systematic errors by BOVIDS are provided in BOV 4. Also, tools to represent and visualize the data that are a good starting point to apply further statistical methods are presented in this section. A visualization of the complete process is given in Figure 3.

| BOV 1: Data preparation
BOVIDS creates a collection of video files, one per night automatically if the data are recorded by the LUPUS observation system. If some data are missing due to power failure, the missing frames can be filled with a sequence of black frames to ensure a joint observation time over all video files. Such sequences of black frames will be labeled as Out by BOVIDS during prediction and, therefore, represent reality quite well.

| BOV 2: Training an object detector (OD)
The final object detector is trained following the subsequent procedure: OD 1. Manual annotation of images.
OD 2. Train a first version of the object detector.
a. Automatic annotation of unseen data.
b. Evaluation of the suggested bounding boxes.
c. Retrain the deep neural network.
In the initial annotation task (OD 1), between 400 and 800 images are sampled stemming from multiple videos per enclosure over the observation period to increase the data variability. The number of images sampled in total depends on how much data there are overall, how difficult the detection appears to be, and whether individuals need to be distinguished. Those images are annotated manually by a freely available third-party software package called LabelImg (Tzutalin, 2015) and the initial training can be performed (OD 2). Hereby, 5% of the data is used as the validation set while 95% of the data is used for training.
To run an adapted version of the so-called "offline hard example mining" (Felzenszwalb et al., 2010), in short OHEM (OD 3), the object detector is used to automatically annotate 300-600 images out of unseen videos of the same set of enclosures (OD 3a). The quality of each such automatically drawn bounding box is evaluated. Hereby, a human assigns one out of four classes (good, okay, poor, swapped) to each bounding box (OD 3b) which is visualized in Figure 4. If the bounding boxes are satisfyingly accurate, the procedure stops at this point. Otherwise, the bounding boxes evaluated as poor or swapped are corrected manually using LabelImg. Those bounding boxes can be seen as "hard examples" as the current version of the object detector struggles at prediction. The freshly corrected annotations together with the well-evaluated bounding boxes are used to increase the training set of the object detector and the object detector is trained on this new, extended set. Again, 5% of the existing data is used for validation. This procedure can be repeated until satisfying results are achieved. In the conducted case study, one iteration sufficed to achieve a decent accuracy. After having trained an accurately working object detector, this object detector is one ingredient required to generate a training set for the action classifiers.

| BOV 3: Action classification (AC)
The action classifier's goal is to predict the pose of an individual on a single image (single-frame, SF) to capture the spatial dimension of the video, respectively, on four subsequent images placed next to each other (multiple-frame, MF) to capture the temporal dimension.
The case study suggests that the following iterative process gen- When starting from scratch, it is most convenient to annotate the behavior of each single frame of a video by annotating the whole video (AC 1), for instance using the third-party software package BORIS (Friard & Gamba, 2016). In the conducted case study, video material corresponding to 170 nights was annotated manually, see Table A1. To generate the training set, equally many images (single-frame and multiple-frame encoded) of each posture (Standing, LHU and LHD) are extracted from the annotated video files by using the previously trained object detector. This balancing is one possible way to ensure that training of the action classifiers works decently (Japkowicz & Stephen, 2002). The reader should be aware that there are different strategies to deal with class imbalance that will not be discussed in this contribution . Due to the class balancing and the underrepresenta- Based on the human evaluation of the single images it is now possible to retrain the action classifiers on a much broader dataset that really represents the distribution of the data that needs to be classified. At this point, the training classes might get slightly unbalanced if the human annotation deviates strongly from the automatic one. In this case standard techniques like random upsampling might be considered (Branco et al., 2016) and are provided by BOVIDS, of course, different ways to deal with this imbalance can also be employed . Once a decent object detector and a wellperforming action classifier are generated, all data can be predicted once more and the performance of BOVIDS can be measured.
At this point, we want to emphasize that training and validation data are generated as usual in machine learning for the object detection and the action classification tasks. However, generation of a suitable testing set and choosing a decent evaluation metric is more F I G U R E 4 Example of the four classes that can be given in evaluation, good (green), okay (yellow), bad (red), and swapped (blue) involved as the performances of the object detector and the action classifiers as single systems are subordinate to the outcome of their sequential application. This will be discussed in detail in Section 2.4.5.2.

| BOV 4: Data prediction
The data prediction step consists of three major parts (DP 1-DP 3) that are discussed in this section and are read as: In the object detection phase (P 1), the system will first decompose a video file into so-called "time intervals". This is a discretization of the continuous data into packages of seven seconds each.
More precisely, for each time interval the prediction pipeline will collect four images. Then, the object detector is used to identify the animal present in the images or, respectively, declare that no animal is present. While this step is governed by a Mask-RCNN network by Hahn-Klimroth et al. (2021) in the current version the architecture is changed to the much more recent yolov4 network as implemented by Taipingeric (2020)  additionally as one image in which these images are placed next to each other (multiple-frame encoded representation (Franche & Coulombe, 2012;Ji et al., 2013)).
The subsequent step, the action classification phase (P 2) to determine the behavioral classes, is a classical image classification task.
For both, the single-and multiple-frame representations, this task is governed by two independently trained EfficientNetB3 CNNs per time interval. The final prediction for any time interval is calculated as the average over both outcomes. Hahn-Klimroth et al. (2021) already describe that the "total classification" task (distinguishing Standing, LHU, LHD) might be much more difficult than the "binary classification" task (distinguishing Standing and Lying) and gives the possibility to map the final prediction from LHU and LHD to Lying.
The approach of BOVIDS toward this binary task is slightly different.
It is necessary to train a set of independent networks that purely govern this binary classification such that possible features can eventually be better learned.
To reduce classification flickering, Hahn-Klimroth et al. (2021) propose In the present work, the chosen set of postprocessing rules varies significantly between the binary and the total classification task. Indeed, as the binary classification task is meant to study longer periods of Standing and Lying, phases up to 5 min are dismissed. Furthermore, in the total classification task, it is distinguished between adult common elands and nonadult common elands as the latter show shorter phases than the adult individuals. A detailed overview over the used postprocessing rules can be found in Table A2.

DP 2: Data evaluation
As the prediction of a deep learning-based system works, in the end, as a black box, it is very important to assure the quality of the prediction regarding all quantities of interest. Therefore, it is crucial to define a valid testing set and appropriate evaluation metrics. Due to the iterative process on how the training set was found, the images used for training the action classifiers are an almost uniform sample from the whole observation period. Thus, any specific video is an adequate sample to determine the expected accuracy which implies that a good testing set is given by the already manually annotated videos. Observe that during training only the object detector and the action classifiers as single systems were evaluated with respect to a validation set but ultimately, it is more important that the prediction of a complete video is accurate with respect to biologically interesting quantities.
To quantify the accuracy of the prediction on the testing set, performance indicators from machine learning theory as well as bio- logical key figures are evaluated by the following four quality criteria.
QC 1. Analysis of the object detector per night ("detection density").
QC 2. Accuracy and f-score as well as a comparison of the number of phases, the median phase length, and the overall percentage per activity class between BOVIDS' prediction and the manual annotation.
QC 3. Number, length, and type of misclassified sequences.

| BOVIDS' performance in the case study
This section is devoted to reporting the validity of postprocessing rules and the quality criteria QC 1-QC 4 in the case study.
Subsequently, in the next section, the nocturnal behavior of the common elands is presented.

F I G U R E 5
Example of one night of one common eland with the plotted phases of the three behavioral states of the total system given by BOVIDS to look for quality criteria QC 4 while its median turns out to be 99.8% with a 0.25-quantile of 97.5% and a 0.75-quantile of 100%. To analyze the last quality criteria (QC 4), namely, to look for apparent outliers, BOVIDS creates one plot per predicted night (for the binary and for the total classification task, respectively) representing the timely course of the behavioral phases (see Figure 5). There are few apparent outliers on data which were not manually labeled, and the automatic annotation was checked randomly. In most cases, it was found that BOVIDS' prediction is correct even if it seemed to be suspicious. The observed misclassifications during this step fit exactly into the description of the errors in QC 3 and the frequency is comparable.

| The nocturnal behavior of common elands
The data presentation tools of BOVIDS give a first visual result regarding the relative distribution of the behavioral states, their means over all nights, and the rhythm of phases of behavioral states (see Nevertheless, the longest observed phase of LHD was by nonadult individuals (47.8 min) followed by the male adults (32.9 min) and the female adults (14.8 min).
Beside the length of the phases, the number of phases per night is also an interesting parameter. Figure 10 visualizes the number of Lying phases (binary classification system) as well as the number of LHD phases (total classification system). Note that the number of Standing phases equals the number of Lying phases ±1. The above illustration highlights the different age categories of young, subadults, and adults, with sex being distinguished in the adult category.
The phases in Lying (see Figure 10(a)) appear to be constant across individuals and differences between sex and age groups are not evident. The situation is different when it comes to LHD, where the young animals have a significantly higher number of phases than the adults. The subadults tend to have slightly more LHD phases than the adults, but they are already closer to the values of the adults than to those of the young. However, this happens only very moderately, by a factor of between 5.6% (Standing) and 7.5% (LHD). It will be seen later that the methodological error will underestimate those quantities with respect to the postprocessed data slightly. Therefore, the errors partly account for each other.
The object detector seems to work very well (QC 1) as the median object detection density is very high. On nights with a lower detection density, the video material was checked manually, and it can be observed that the individuals were mostly Out if the object detector did not find them, or only small parts are visible at the border of the video recording.
Subsequently, quality criteria QC 2 and QC 3 are discussed. Since the number of phases per activity class and the phase length analysis refer to Standing and Lying from the binary classification task as well as LHD from the total classification task, the discussion focuses on the reliability of these quantities. Overall, the accuracy and the f-score of BOVIDS' prediction are very high for machine learning- This has two reasons: First, and most importantly, the used object detector seems to be able to discriminate between two individuals very accurately. Secondly, the action classifier seems to be very robust against truncation effects when, for example, the bounding boxes of the two animals overlap.
In summary, the activity budget per night is predicted without any bias, as expected, while the median phase length per activity class is overestimated due to postprocessing rules by a moderate factor of no more than 7.0%. Thus, the automatic prediction is very precise with respect to the postprocessed data. Furthermore, the overall accurate description of the three poses Standing, LHU, and LHD by BOVIDS can be seen in connection with the types of misclassifications occurring on the testing data. All misclassifications between Out and a real activity class are due to heavy truncation or occluding effects in which a human annotator might see hooves or small parts of the animal and is able to safely infer the behavior, but a machine cannot. In this case, it is favorable if the object detector does not find the animal in the first place. Furthermore, almost all misclassifications between LHU and LHD can be explained by the fact that common elands show, from time to time, a grooming behavior at their hind leg which is, on a single image, hard to distinguish from LHD. Such errors need to, of course, be considered and analyzed, but do not seem to be fixable by more training data or fine-tuning the networks if the input data format does not change significantly. As mentioned earlier, the median phase length as well as the median number of phases per night are only slightly overestimated. In the binary classification task, there are some short misclassifications with respect to the postprocessed data less than five minutes in length. These errors are just delayed transitions between the behavioral states due to, for instance, the applied rolling average during postprocessing. Therefore, these misclassifications neither influence the number of phases of Standing and Lying nor the animal's rhythms, but only slightly change the total duration of a specific phase. Finally, there are few misclassifications that are, probably, unavoidable in a deep learning classification task. Of course, accuracy can, in principle, always be improved by additional rounds of example mining and fine-tuning the action classifiers, but it is questionable whether an even higher median accuracy of 99.4% can be reached on a three-class classification task.
A natural question, of course, is how well the findings from the test series can be generalized to unseen data of the same enclosures.
Recall that the action classifiers are, in the end, trained on a random collection of images over the whole observation time due to offline hard example mining. Therefore, the testing set can be considered an almost random sample which includes a few more difficult images as expected on a random balanced sample. Thus, the analysis of the performance on the manually annotated nights (the testing set) yields a very good approximation of the overall performance.
This claim is also supported by the analysis of QC 4. The type and frequency of errors on randomly selected, nonmanually annotated nights were found to be comparable to those in the test set.
Finally, even if BOVIDS makes a small number of mistakes that would not occur if a trained observer manually annotated the data, the very large dataset overcompensates those few errors. Another approach to generating a large dataset is to have different, probably untrained, human observers annotate a comparable number of nights. Apart from the much higher cost, it is supposed that the interobserver reliability might be worse than the reliability of BOVIDS.
Overall, our findings show that BOVIDS performs very accurately in the case study and its predictions can be safely used to generate a large amount of annotated data, which would not have been easily possible without automation.

| Challenges and limitations
As for any deep learning-based classifier, there are various challenges to overcome during fine-tuning the underlying model. Even after extensive fine-tuning, there will be cases in which the system fails. While the last paragraph already discussed that small errors are overcompensated by evaluation of much data, this section is devoted to exploring typical misclassifications that arise if BOVIDS is used.
A major challenge is given by highly truncated sequences of video material. In many applications, it is not possible to install cameras in a way that allows recording of every edge of the enclosure. This can cause misclassifications during action-classification.
Indeed, if only small parts of an animal can be seen, like only its hoofs or its head, and the object detector draws a decent bounding box, it is even for trained humans hard or even impossible to classify the behavior. To overcome this issue, it is possible to classify bounding boxes that are close to the image's border by a deterministic rule. A natural choice might be Out but in special cases one might use information about the recorded enclosure to infer the behavior in the truncated area. In-depth observation of own data is necessary to identify those regions of the enclosure in which severe truncation effects might occur and to define proper rules on how to deal with them.
Another challenge arises if the animal is not present in a sequence of images. It is possible that an object in the enclosure like a trough might be falsely classified as an animal in this case. This issue can be addressed by more training steps of the object detector or by increasing the so-called minimum confidence score: an object detector does not only suggest a bounding box and a class label but also returns a confidence score between zero and one. If a threshold of this value is defined near one, misclassifications are expected to be very rare, but the bounding boxes of animals are also more easily discarded. Finding a good threshold depends highly on the application and should, therefore, be tested.  to the whole image, such a description becomes meaningless. But overall, we believe that in many enclosures this approach can be implemented within the current deep learning system and can deliver more information on ungulate's behavior.
Furthermore, it is to discuss whether the iterative process used to create a reasonable training set could be improved. The degree of automation of the system at hand resembles more the one of a "machine-assisted" evaluation of video material than the one of an autonomous deep learning system. Such iterative processes to obtain reliable machine learning models is extensively studied in a recent publication of Miao et al. (2021) at the example of camera-trap images. The findings of the aforementioned publication as well as the findings of the current paper indicate that such a partly auto- The explicit implementation of these classifiers is independent from this approach and, therefore, it might be tempting to conduct comparative studies regarding the performance of different recent deep learning architectures within the proposed system.

| The nocturnal behavior of common elands
A first finding is that independent from the factors age, sex, and keeping zoo, all individuals exhibit a higher percentage of Lying than Standing during the night. As the night progresses, the percentage of Lying increases significantly. This is in line to findings of similar studies on African elephants (Loxodonta africana), blue wildebeest (Connochaetes taurinus), or Arabian oryx (Oryx leucoryx), where the observed animals also show most of the sleeping behavior or inactivity in the second part of the night (Davimes et al., 2018;Gravett et al., 2017;Malungo et al., 2021).
When considering the LHD, it should be noted that this posture most likely corresponds to the typical REM (rapid eye movement) sleep posture. As mentioned in the ethogram section, a behavioral component to recognize REM sleep is the head being down due to postural atonia (Lima et al., 2005;Zepelin et al., 2005). In this study, we use this characteristically REM sleep posture to determine REM sleep. This approach is in line with the study by Zizkova et al. (2013) on common elands and the study by Ternman et al. (2014) on cows, which shows that REM sleep can be detected with sufficient certainty based on behavioral surveys. This procedure is also supported by a study on lesser mouse-deer (Tragulus kanchil), which shows that REM sleep can be divided into two categories, one of which is the most common, where the head lies down most of the time, making this a valid indicator to recognize REM sleep in behavioral studies (Lyamin et al., 2021).
Sex has been found to have an influence on the total amount of LHD during the night. The REM sleep periods of adult females last slightly longer than those of adult males, a fact which is also known across multiple phylogenetic states, for birds and mammals (Cajochen et al., 2006;Rattenborg et al., 2017;Steinmeyer et al., 2010). However, other studies show that there are no sex differences when individuals are similar sized between the sexes, while dissimilar-sized animals should have differences (Ruckstuhl & Kokko, 2002). In common elands, males are larger than females (Leslie, 2011;Myers et al., 2021), confirming the differences found between the sexes. In addition, Standing was found to increase with age. Interestingly, this finding is supported by the recording of a male individual at both subadult and adult age, which shows a significant increase in the total amount of Standing per night.
Our results are in line with previous results on different mammals, as age is known to be an influencing factor for activity/rest cycles (Ruckstuhl & Neuhaus, 2009;Siegel, 2005;Steinmeyer et al., 2010). Moreover, age also influences REM sleep behavior in mammals and birds (Cajochen et al., 2006;Rattenborg et al., 2017;Ruckstuhl & Kokko, 2002;Steinmeyer et al., 2010). This effect was also observed in the common elands in this study, where the extent of LHD differs between the three age classes-young, subadults, and adults. A study on Giraffes (Giraffa camelopardalis) also shows that age and sex have an influence on the behavior Standing, while only age has an influence on REM sleep (Burger et al., 2021). The study by Burger et al. (2021) further reveals that housing conditions can be discarded as an influencing factor for both behaviors.
These results correspond to the results in this study with common elands, where the keeping zoo and thus housing conditions can also be discarded as influencing factors. Of course, the factor housing condition consists of several factors such as, among others, enclosure size, and the presence or absence of other types of animals in the vicinity or lighting conditions. While the recorded data do not allow to evaluate each possibly influencing factor individually, our study reveals that the sum of those effects is negligible and can be discarded.
Besides the total amount of time during the night, the duration of the single phases is also of interest. Here, the males differ from the females within Lying, whereby males show longer Lying phases than females. This fits with the result that adult males have a higher amount of LHD. Also, the age has an influence on the lengths of the phases. The nonadult animals exhibit shorter periods of Standing and longer periods of Lying than the adult ones. This also matches with the results regarding the nocturnal activity budgets. Within LHD the number of phases vary between the different categories of individuals. The mean phase length of LHD in all adult common elands is 9.5 min on average, with females slightly below this at 8.8 min and males slightly above at 10.2 min. These phase lengths are consistent with those of male Arabian oryx (Oryx leucoryx), which have a mean phase length of 7 ± 2 min in the dark in winter, and 10.5 ± 1.5 min over the 24-h cycle (Davimes et al., 2018). This again indicates differences between the sexes and high similarities within each group of individuals. Again, it seems that certain rhythms are present depending on the sex and the age but being independent of the specific individual. This observation might be the starting point of a much more detailed analysis of rhythms in African ungulates' behavior.

ACK N OWLED G M ENTS
We thank the Opel-Zoo Foundation Professorship in Zoo Biology of the von Opel Hessische Zoostiftung who funded the research leading to the results. The authors gained great support from directors and curators of the participating zoos (in alphabetical order): Allwetterzoo Münster, Erlebnis-Zoo Hannover, Opel-Zoo Kronberg, Zoo Dortmund, Zoom Erlebniswelt Gelsenkirchen. Also, the animal keepers of these zoos promoted this study mainly by assisting and allowing the data collection and providing information about the animals. Isabel Seyrling and Franziska Zölzer assisted in the installation of the cameras. Finally, we thank the two unknown reviewers for their detailed reading of the manuscript and their valuable remarks that helped to improve the paper significantly. Open Access funding enabled and organized by Projekt DEAL.

CO N FLI C T O F I NTE R E S T
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. in between a phase of behavior X and behavior Z, then the event will be dismissed (marked as X) if it consists of less time-intervals than indicated by the minimum length of XYZ. In those codes, Standing is abbreviated to "A," LHU to "L" and LHD to "S" in the total classification task. In the binary classification task, "A" means Standing and "L" means Lying. "O" stands for Out in both tasks. *X* is meant to be read as any combination YXZ where Y and Z do not equal X. The applied rules of dismissing short phases can be found in Table A2.

AUTH O R CO NTR I B UTI O N S
Regarding the special state Out, the post-processing rules are a bit more elaborated. If flickering between Out and a real behavioral state occurs, this is very likely due to a failure of the object detector if an animal is occluded or truncated. Therefore, if a sequence of a specific behavioral state X (Standing, Lying, LHU or LHD) is interrupted by phases of Out, the Out phases are dismissed under the following conditions. First, each single phase of Out must be shorter than 27 time-intervals (total) or 135 time-intervals (binary). Second, the total percentage of X in the sequence needs to exceed 20%.

TA B L E A 2
Overview about the minimum length a specific behavioral phase needs to have in order not to be dismissed in the post-processing step